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© A processor array employs an SIMD architecture and includes a number of sub-arrays (S1 S4) Each sub 
oTch ip memor^ " e ' ementS ^ ^ Pr ° CeSS ° r 6,ement * «^ !oca,1 o'e'nc u ng 

nSlt* 9 , reater ' nan 1 - m "°'t ™ae Path .s selectively configurable as a one-bit path to or from each of m 
processor elements or as an m-bit wide path arranged to communicate complete m-bit word^of memory daS 
between the reg.on of off-chip memory and respective processor elements * 
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PROCESSOR ARRAY SYSTEM 



The present invention relates to parallel processing computer systems, and in particular to a system 
comprising san array of processor elements employing an SIMD architecture. One example of such a 
system is described and claimed in GB-A-1445714, assigned to the present applicants. 

It is known to construct the array of processor elements from a number of sub-arrays or modules 

s formed on separate chips. Each sub-array comprises a number of processor elements and data paths are 
provided for the communication of data between neighbouring processor elements both within the sub-array 

• and from one sub-array to adjacent sub-arrays. Each processor element has local store assoc.ated w.th it. 
Part of this local store is provided as on-chip memory integrated with the processing elements on the sub- 
array In addition, in order further to increase the memory available to the processor elements w.thout 

io reducing the levels of integration possible in the chip, off-chip memory is used. The off-chip memory is 
physically separate from the chip bearing the respective sub-array but is provided with connections to all 
the processor elements on the sub-array so that each processor element sees a region of the off-chip 
memory as an extension of its local store. 

According to the present invention, in a processor array employing an SIMD arch.tecture. the array 

is comprising a number of sub-arrays and each sub-array comprising n processor elements, each processor 
element being connected to local store comprising on-chip memory, each chip is connected by an m-b.t 
wide path, where m is an integer greater than 1. to a region of off-chip memory, this path being selectively 
configurable as a one bit path to or from each of m processor elements, or as an-m-bit wide path arranged 
to communicate complete m-bit words of memory data between the region of off-chip memory and 

20 respective processor elements. 

Preferably m is equal to n and each sub-array is formed on a separate chip. 

in conventional processor arrays built from single bit processors, the region of memory associated with 
each individual processor is also one bit wide so the successive bits of a given data word or number are 
held in different locations within that memory region. This is known as a ■vertical' storage mode At a given 

25 time every processor accesses the same bit of the data held in its own memory, this set of bite being 
referred to as a memory plane. For example, in a particular instruction each processor may access the sign 
bit of a number. However, when the memory of such a processor array is accessed by the MCU. it normal y 
accesses the memory bits that comprise a particular row of a particular memory plane. Thus a word of data 
written by the MCU is said to be in 'horizontal' storage mode. If such data is to be processed by the 

30 processor array it is usually necessary to re-arrange the data into the 'vertical' storage mode described 
above the conversion of data between the two storage modes being referred to as corner turning Such 
corner turning can be performed by an instruction sequence involving shifting and merging of data, but 
there is clearly a performance overhead associated with the input of data to the processor array or the 

return of results from it. . v.^^a™ 

as When the processors are byte wide as disclosed in our co-pending application European application, 
aoent's reference 80/3430/02. claiming priority from British application number 8925720.8 and also entitled 
Processor Array System, the on-chip memory is also byte wide, and successive bytes of each number are 
similarly held at successive byte wide locations in the memory of a given processor. This may be 
considered to be a vertical arrangement of bytes, although the bits within each byte are of course accessed 

40 and processed in parallel. . . 

Although, as mentioned above, off-chip memory has been used in the past as an extension of the local 
store the organisation adopted for the off-chip memory has always been the same as that used in the on- 
chip memory, with the data stored in vertical mode. The present inventor however has discovered that 
significant advantages can be gained by arranging the off-chip memory in horizontal mode with a word 
45 length related to the number of processing elements in the sub-array. Corner turning may then be carried 
out in transit between the off-chip memory and the processing element using an n-b.t shift register. 

With the arrangement of the present invention the MCU, that is the scalar processor which controls 
operation of the array, is able to access an element of a matrix stored in off-chip memory with the same 
speed as access to a vector element or to a scalar. This provides a significant .improvement in the 
so performance of the system when handling matrices. ^ uaan 
Preferably the n-bit wide path is also configurable to communicate words of length n/2 or n/4 between 
the off-chip memory and respective processor elements. Preferably each sub-array is arranged to provide a 
locallv generated address for each word accessed in the off-chip memory. 

Preferably the processor array system is connected to a host processor arranged to address the 
processor array as an extension of its own memory. 
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proctor 3rrayS ° f S ' n9,e Pr ° CeSSOrS the ^ memor V address is the same for every 

In the present invention, each n bit word accessed in nff-rhin mnm^ 
processor, as already explained. The chip is provS J5, meaTs 2™eZ 7J a^T™ ^ 
may optional* be computed locally within that chip rather tha b^ng ^S^^ly 7^1^^ 
known as '.oca. indexing', since each processor may thereby suppi? a °Ses fZ Z'Jl ZT^r! 

^%jz^^:-z^^ ~ - — — p 2 ™ 

addSrirp^™^ meanS " ^ — 0- - store one 

Preferably each processing element is arranged to construct a local index address in an assoriat^ 

orT T ? tranSfer . the addr6SS fr ° m the aSSOCiated reoiste ' to the on-chip buffer to write X to 1 
locally-indexed position in the off-chip memory ata to a 

^pSI^^r^SS r69iSter " - ° Perand re9iSter — ed to - Emetic unit of the 
Alternatively, each sub-array may include an on-chip buffer for off-chip memory the on-ehin h„«or 
being arranged to hold memory data words associated with respective processoT etement and ,h* n H 

(eg tx::Tr c T r T m ^ bein9 arranged to ho,d an a — s^ssss. 

Preferably the arithmetic unit is a byte-wide processor and the operand reoister forms oart nf , m »• 

mull™ nStWOrk H Vin9 3 d3ta ° UtPUt f ° r eaCh byte P0Siti0n and the process'r^mrLCr? clde U s a 

wSre^'rtr TnT*?? d3ta fr ° m 3 Se,6Cted ° ne ° f the out P u ts tothe arithmefc U ri 
an operand reo^r fo"t elm T '? ^ PrSSent ,n eaCh processin 9 e — » functioning as 

mom« »k u S3me re9 ' ster may be used to receive the data words from the off-chio 

memory thereby enabling the required functions to be implemented without any further ncrease " the 
complexly of the processing element. The use of a shift-register in conjunction w^th the /UJJ ancMn 

r D ro^'T nCt, ° n W ' th 3 ? te " Wide ALU " deSCnbed 0ur above cited co-pending application 
A processor array ,n accordance with the present invention will now be described in further detail with 
reference to accompanying drawings in which: unner deta " w,tn 

Figure 1 is a block diagram of a processor array .system; 

Figure 2 is a diagram of a processing element for use in' the system of figure 1 • 
Figure 3 shows one configuration of memory interface for a single sub-array ' 
Figure 4 shows an alternative configuration of memory interface- 

memor'worr " * ^ ***** when tW ° le bit data words ™ P*** into each 

vhirI h l! yStem ° Mhe P / 6Sent inVemi0n iS described below in the context of a processor array system in 

vstem of r CeSS ' n9 ( e,ementS inC ' Ude 3 byte " wide ari,hmetic unit ™« a muSi-byte shift registe The 
•ystem of the present mvent.on is however by no means limited to use with this form of svs tlrr ?IS LT 

SZ^pEI * h an otherwise ™ array system such as SoS£^cK^7„' 

^sr.Ksn^* comprises 32 ~ 9 eiements pe — - 

pel** TXVL7Z° C T ng e,ement ,S Sh ° Wn " R9Ure 2 - 1716 pr0CeSsin 9 element is arranged to 
perate on a byte of 8 bits. The processing element includes an 8 bit wide arithmetic unit ALU and ft h» 

ZT e t^o:z yin9 data b r een the arithmetic unit alu and 

Inri^n k ? P^ss.ng element further includes a four byte wide 32 bit operand shift network Q 
mpnsing a byte-w.se sh.ft network Q1, , bit-wise shift network Q2 and an output register Q0 The shi« 
Btworks are formed from appropriately interconnected 2:1 multiplexers. The o^^^t^J^S^' 

ovdVfctSl^ata o^ TT" t0 COrreSP °" din 9 >W ° f the byte' wise JTnS£?TZ 
ov de a cyclical data path. Another connection from each output of the outout reoister on nn « tn * 
u.t, P .exer MUX which is linked by a byte-wide data path to an fop* " he SZ^JTZS Z 

the anftmetc unit ALU is input at one end of the byte-wise shift network Q1 and the sinqle bit carrv- 
■tputof the arithmetic unit ALU is input at one end of the bit-wise shift network Q2 C data IfputTom 

i:^z^x?~* ster 02 is taken via a bit - ide data p - * ~ en^rrs 
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The different elements of the processing element PE and examples of the use of the processing 
element PE in carrying out arithmetic operations are described in further detail below. 



5 Memory Paths 

This is organised as 8 bits wide and a byte might be read or written in one clock cycle but not both in 
the same cycle. In practice it is convenient to have at least 512 bits of on-chip memory per PE. For clarity, 
the on-chip memory is omitted from Figure 2. The read port of the on-chip memory is shown at the top of 
io the figure, and the write port at the bottom. 

The PE data paths for the least significant bit are substantially the same as in conventional single-bit 
processing elements, such as those described in the above cited patent. However for a full set of singte-bit 
operations on the on-chip memory the following additional functions are necessary; 

1. On the read port, option to select any bit of the on-chip memory byte and place it on the least 
75 significant bit. The other bits on the PE input can be regarded as undefined. 

2. On the write port, option to replicate the least significant bit of the ALU output in all the bits. 
Alternatively this replication may be done in the ALU itself; 

3. Option to write just one bit of the byte into on-chip rnemory. Preferably this is done by gating the 
writes to individual bits; it may alternatively be done as read-merge-write but this takes longer. 

20 These additional functions listed above are equivalent to providing single-bit access (as well as byte 
access) to the on-chip memory. In general the bit address used for accessing the memory is different from 
that used for masking the ALU, since for single bit operation it is always the least significant bit of the ALU 
that is used. 

A one bit wide path from off-chip memory is multiplexed with the least significant bit of the on-chip 
25 memory, the other bits as being "don't care" in this case. Similarly, off-chip memory may be written from 
the least significant bit of the on-chip memory write path. 

Q Register 

30 ~~ 

As described above, in the present example the Q register is a 32 bit wide register with shift facilities. 
In general the length of the Q register is preferably, but not necessarily, matched to the word length of the 
operands. If, for example, the PE is required to process 48-bit words then Q is preferably at least 48 bits 
long. It operates like a specialised on-chip memory with simultaneous read and write ports. 

35 It is possible to construct the Q register so that only the least significant byte of the register is used as 
an ALU operand but input can be made at any byte position. In practice however it is found to be more 
useful to be able to select any byte as ALU operand but to restrict input to one byte of the register. The Q 
register may considered to be of length 8, 16, 24 or 32 bits. When emulating the instruction set of existing 
single-bit arrays, the least significant bit of the most significant byte is used. 

40 The current outputs of the Q register pass in succession through two shift networks: 

1. An optional right shift of 8 bits (1 byte) with the ALU output fed in at the MS end. It may be useful also 
to be able to input the ALU data at the MS byte and pass the other bytes unchanged. 

2. An optional right shift of one bit with either the ALU carry output or the value of the register fed in at 
the MS end. When this is done, the bit shifted out on the right is available to shift in to the S register, i.e. 

45 the multiplier register. 

The output of the second shifter is always clocked into the Q register. The two shifts may be applied 
separately or together and either or both may be global or under local activity control. 

so S Register (Multiplier Register) 

The S register is an 8-bit shift register used to hold "old" memory contents for read/write. It is also 
available as a programmer-visible register used in particular in multiplication. It may be shifted one bit to 
the right, usually as an extension to the Q register. The least significant bit may be used as a multiplier bit. 



Neighbour Input Multiplexer 
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;a ch process e^n, - a d„a input = ^< ^ ^ 
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This is one bit wide and may be loaded from the carry borrow output of the ALU. it may be used as 
carry in to the ALU or as a serial shift input for the Q register. 

s ALU 

As described above, this is now 8 bits wide, taking a selected byte of the operand register means Q 
and of the input multiplexer data as input and the C register as carry-in. 

A variety of functions are provided. To give maximum flexibility of single bit functions the least 
;o significant bit has full function units for both sum and carry outputs. For the other bit positions arithmetic 
add, subtract and reverse subtract are provided, together with copy of either operand, and a variety of bit- 
by-bit boolean functions. 

The ALU has masking features which permit operations on selected bit fields rather than being 
constrained to byte boundaries. Thus an 8-bit mask, common to all PE's is applied to the ALU and has the 
15 following effect: 

If mask bit = 1 normal generation of carry out bit 

If mask bit = 0 carry out bit = carry in bit for that bit position. 

Result bit: If mask bit = 1 the normal result bit function for that bit is provided at the ALU output. If 
mask bit = 0 the result value is of no interest, so any value convenient to the implementation may be 
20 provided. 

Masking the ALU operation may be implemented by explicitly gating the carry propagation in the 
manner described above, but this may preclude the use of standard carry predict techniques for fast 
operation of the ALU. An alternative is to achieve the masking by forcing one input of the ALU to 1 and the 
other to 0, for each bit position where the mask bit is 0. 

25 The mask pattern comprises a consecutive set of truebits specified by a start bit and an end bit. This of 
course allows selection of a single bit. For operations on a complete byte, the start bit is specified as zero 
and the end bit as 7. For single-bit operations the least significant bit of the ALU is used, so the start and 
end bits are both specified as 7. Thus, the carry out of the least significant bit is propagated unchanged 
through the other ALU bits and may be clocked into C. 

30 During multiply the LS bit of the S register acts as a multiplier and may achieve this by selecting the 
ALU function as either "copy" or "add" dependent on the local value of that bit. There is also provision for 
local control to distinguish between add and subtract, for such purposes as non-restoring division. The LS 
bit of the S register may be used to control this function. 

35 

Merge Function 

This is the logic that selects the memory output data as either the ALU output or the old memory 
contents. It is 8 bits wide and :sach bit has its own activity select but its function is otherwise the same as in 
40 the single bit systems disclosed in the above cited patents and applications. The merge function gives an 8 
bit path to the on-chip memory and as already noted the off-chip memory path is taken from the least 
significant bit. 

45 Neighbour Paths 

The neighbour output (not shown in the Figure) needs to be either the least significant bit of a selected 
byte of, the Q register (for shift functions) or the carry- out of the ALU (for ripple add functions). The 
memory path may be common with the neighbour output. The selected neighbour input may be used 
so instead of the C register as the carry-in to the ALU. 



D Plane 

As in the systems described in the above cited patents, a D-register forming a data plane for Fast 
Input-Output operations may be used. The D register can be loaded from memory input, or supplied as 
data on the memory write path. A single bit D register may be used, linked only to the off-chip memory. If 
the on-chip memory is operated as a cache then I/O data can be cached and D plane transfers to or from 
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the on-chip memory may then be implemented. In the latter case the D register is 8 bits wide. 

Example of PE Usage - Unsigned Integer Multiply 

5 The method for unsigned integer multiplication of 32 bit operands, producing a double length product is 

givel be^ For ^description it is assumed that Q-is initially zero. Both operands and the result are 

in the on-chip memory. 

For each byte of multiplier 
io Load multiplier byte into S. 

For each bit of multiplier byte 

For each byte of multiplicand 

Where LS bit of S 

Add Is byte of Q register and multiplicand byte from memory 
75 Elsewhere 

Copy Q register to ALU output 

Endwhere . 4U . 

°b t of S ™ ^ Puts one competed rjult bit into S. aiigns Q ready for the next mu.tip.ier bit. and d,scards 
the multiplier bit just used, bringing the next multiplier bit into the correct pos.t.on. 

25 End For 
End For 

Store completed result byte from the S register. 
End For 

For each byte of ms half of result 
Store result byte from the appropriate byte of Q. 

cycles would be needed if the result was lo be written under activity control. 

in tea his scheme can be applied directly to multipliers 01 any length, even greater than 32 Ms. 

together. 
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Arbitrary Bit Fields 



so 
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Operations such as 7 bit addition (for examp.e used « 'ZT^^^ 

operation. Similarly the ALU result could be written back to Q and then ahgned to the result space 
required, rather than being written direct to memory. 
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Neighbour Operations 

For rippte add, a carry is taken from the output of the ALU and brought in as the carry-in to the ALU of 
a neighbouring PE. The operands are typically a byte of the Q register and a memory byte. Using masking, 

5 the carry may propagate through any bit field within the ALU. The functionality of previous single-bit arrays 
maybe modelled by selecting C as the operand at the input multiplexer, and masking of all but the least 
significant bit of the ALU. For neighbour shifts, the neighbour output is taken from the least significant bit of 
the selected byte of the Q register. The neighbour input comes in at the ALU carry input and the ALU 
function is such that this value is reproduced at the function output, and thus maybe written back to Q or to 

w A. A variant of this is for the ALU to propagate the carry-in through the carry-out by appropriate function 
setting on the least significant bit and by masking of the other bits: simultaneously the Q register is shifted 
one place right with the ALU carry brought in at the ms end. By repeating this operation, any number of bits 
in the Q register may be shifted to a neighbouring PE at the rate of one bit per PE per cycle. 

Memory organisation 



External Interface 

20 

The details of the processor together with its on-chip memory and the single bit path between each 
processor and its memory, as described in the preceding paragraphs, provide a great deal of commonality 
with previously known arrays of single bit processors in that the arrangement of data in the off-chip memory 
can be the same in the two cases. In particular a boolean matrix can be held in a plane of the off-chip 

25 memory and in one instruction loaded into the processor array, each bit of the boolean matrix being loaded 
into the least significant bit of one of the registers in one processor. 

To take advantage of the option of having data held horizontally in off-chip memory, further circuitry is 
added between the sets of processors (the processor being depicted in figure 2), and the memory pins. 
One such configuration is shown in figure 3. This shows, as an example, a chip containing 32 processors at 

30 the top of the figure, and the region of memory associated with that processor chip at the bottom of the 
figure. The memory data paths are also 32 bits wide. In practice there would be further memory paths for 
parity or Error Correcting Codes, dependent on the technology used to implement the memory. This error 
handling uses conventional techniques, either implemented as part of the memory or on the processor chip, 
and is not considered further here. For convenience the memory is shown as having bi-directional data 

35 paths. 

The 32-bit wide data path at the left of figure 3 takes data into or out of the processor chip. It is 
convenient to number the bits in this data path 0 to 31 in the same way as the processors within the chip. 
There is a single bit wide connection between bit 0 of the data path and the single bit connections to and 
from the off-chip memory associated with processor 0; these connections at the processor are shown in 

40 figure 2. Similarly data bit 1 is connected to and from processor 1 and so on with data bit 31 connected to 
and from processor 31. These connections provide access to vertical mode data in the off-chip memory at 
one bit wide per processor, as well as access to boolean matrices in the off-chip memory. 

An alternative memory access mode makes a 32 bit memory connection to the Q register of each 
processor. For writing to memory all 32 bits of the Q register of a particular processor are gated onto the 

45 memory data path at the same time. Typically the Q register of processor 0 would be output to the memory 
in the first memory write cycle, followed by the Q register of processor 1 , and so on through to processor 
31. Conventional techniques such as multiplexers or tri-state gating logic are used to put the contents of the 
chosen. Q register onto the data path. Similarly when reading the memory the entire contents of the 32 bit 
data path are clocked into the Q register of one of the processors, usually processor 0 followed by 

so processor 1 and so on. In order to do this it is necessary to have additional multiplexing at the inputs to the 
Q registers; this is not shown in figure 2. 

The addressing for the off-chip memory is shown near the bottom of figure 3. Usually the address is 
common to all the regions of memory, and thus to all the processors in the complete array. In this case the 
address multiplexer selects the global address broadcast from the MCU. For local addressing each 

55 processor computes the memory address that it requires, leaving the resulting address in its Q register. 
Then the addresses are copied one by one into the address buffer, which is part of the processor chip. This 
buffer may be constructed from a random access memory having 32 words, one word for each processor 
on the chip, each word being wide enough to hold a complete address for the region of memory associated 
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Memory 
address 


Contents 


1000 


A(5,1) 


1001 


A(5,2) 


1031 


A(5,32) 


1032 


A(5.33) 


1063 


A(5 ( 64) 



Each line in the table above represents the address and contents of a 32 bit memory word. The 
addresses are given in decimal and have arbitrarily been shown as starting at 1000. 
The following are examples of patterns of memory access and processing: 

(a) Straightforward processing of the entire matrix may be done in two halves: columns 1 to 32 of the 
data, followed by columns 33 to 64, for example. To load the first set of columns into the processor chip 
the sequence of memory addresses is 1000,1001 ,1002,... 1031 , and this results in processor 0 receiving 
data item A(5,1) and so on. After processing (for example taking the square root of the data value), the 
results may be written back to the same locations or to a similar set of locations elsewhere in the 
memory. 

(b) Matrix A can be effectively shifted along the row to line it up with some other matrix, for example to 
perform the following operation over all I and J: 

B(I,J) = B(U) + A0.J + 3) (for J < 62) 
(l,J) = B(I,J) + A(l,J+3— 64) (for J > =62) 

The first 32 columns of matrix B are read in normally. Then matrix A is read using the following 
sequence of addresses: 1 003,1 004.1005...1034. 

For processing the second half of matrix B the addresses for accessing matrix A are: 1035,1036.. ..1063, 
1000,1001,1002. 

Note that this re-arrangement of data along the rows takes no longer than the direct access in the first 
example. 

(c) Local indexing can provide a rapid implementation of the following type of operation: 
B(I,J) = A(I,Z(I,J)) 

In this case Z is an integer matrix taking values in the range 1 to 64. Again this is done in two halves 
corresponding respectively to the first 32 columns of B ( or equivalently of Z). and the second 32 
columns of B. The set of Z values is loaded into the processors, their values checked as required, then 
adjusted by adding 999 everywhere; this is the start address of A, adjusted for the fact that Fortran 
indexes start at 1 . Then the indexed read is performed as previously described, and the result copied to 
matrix B. 

It is thus apparent that this memory organisation provides very fast, or in many cases free, data re- 
organisation when the data re-organisation is local to the region of memory associated with a given 
processor chip. 

The shift register arrangement of figure 4 can be extended to provide options for input or output of 16 
or 8 bit data as well as 32 bit. Figure 5 shows an example of how multiplexers can configure the shift 
register to deal with either 32 or 16 bit data. The M register is split into four sections of size 16 words of 16 
bits. Wifh each multiplexer selecting its right input the effect is the same as in figure 4. For 16 bit data the 
left input of each multiplexer is selected, the left half of the M register associated with each processor is not 
used and is left undefined. The effect of the multiplexing is that the most significant halves of the memory 
words are associated with processors 0 to 15 and the least significant halves with processors 16 to 31. An 
extension of this multiplexing arrangement allows each memory word to hold four 8 bit values. 
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1. A processor array employing an SIMD architecture, comprising a number of sub-arrays (S1...S4), each 
sub-array (S1...S4) comprising n processor elements (PE), each processor element being connected to 
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local store including on-chip memory, characterised in that each chip is connected to a region or off-chip 
memory by an m-bit wide path, where m is an integer greater than 1, selectively configurable as a one-bit 
path to or from each of m processor elements or as an m-bit wide path arranged to communicate complete 
m-bit words of memory data between the region of off-chip memory and respective processor elements 
(PE) 

2. An processor array according to claim 1. in which m is equal to n and in which each sub-array is formed 
on a separate chip. ,. 

3 A processor array according to claim 1 or 2. in which the m-bit wide path is configuraole to commumca.e 
words of lenqth m/2 or m/4 between the off-chip memory and respective processor elements (PE). 

4 A processor array according to claim 1. 2 or 3, in which each sub-array includes local address generat.on 
means for locally generating an address for each word accessed in the off-chip memory. 

5. A processor according to claim 4, in which the local address generation means include an on-ch>p 
address buffer arranged to store one address for each processor element (PE). 

6 A processor array according to claim 5, in which each processor element is arranged to construct a local 
index address in an associated register and to transfer the address from the associated register to the on- 
chip buffer to write data to a locally-indexed position in the off-chip memory. 

7 A processor array according to claim 6, in which the associated register is an operand register (Q) 
connected to the arithmetic unit (ALU) of the respective processor element (PE).7. 

8 A processor array according to claim 5. in which each sub-array (S1...S4) includes an on-chip buffer (M) 
arranged to hold memory data words associated with respective processor elements (PE) and the operand 
reoister (Q) of each processor element is arranged to hold an associated local memory address. 

9 A processor array according to any one of the preceding claims, in which the arithmetic unit ,s a by e- 
wide processor and the operand register (Q) forms part of a multi-byte shift network having a data output for 
each byte position, the processing element (PE) further comprising a multiplexer arranged to commun.cate 
data from a selected one of the outputs to the arithmetic unit (ALU). ... 

10 A computing system comprising a processor array according to any one of the preceding claims, and a 
host processor connected to the array and arranged to address the processor array as an extension of its 
own memory. 
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